The goal of this project is to use R and Tableau dashboard to display the New York City flight dataset and acquire insights on how to reduce departure delay, one of the most addressed issues encountered by all stakeholders involved in the aviation supply chain. John F. Kennedy International Airport (JFK), Newark Liberty International Airport (EWR), and LaGuardia Airport (LGA) in New York City are all incredibly crowded. PANYNJ got in touch with us to see whether we could comprehend the issue and resolve it. We think that departure delays are affected by the four most important elements - weather, flight patterns, carrier and manufacturer. In order to quantify, we select several departure delay indicators to display, such as average departure delay time (in minutes), departure delay occurrences, and departure delay percentage.
To better understand the connection between weather, manufacturing and flight patterns and carrier, our business analytics team has posed the three questions listed below.
Loading the libraries:
library(tidyverse)
library(dplyr)
library(plotly)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(corrplot)
library(xray)
Reading the files:
airlines <- read.csv("airlines.csv")
airports <- read.csv("airports.csv")
flights <- read.csv("flights.csv")
planes <- read.csv("planes.csv")
weather <- read.csv("weather.csv")
Setting up data frames to use later:
allflights <- flights %>%
select(carrier,hour,origin,dep_delay,time_hour,month)
flights_planes <- left_join(flights,planes,by = "tailnum")
Which carrier has the most flights?
allflights %>%
group_by(carrier) %>%
summarise(count=n()) %>%
ggplot(aes(y=count,x=reorder(carrier,-count))) +
geom_bar(fill = "blue",color=" gray",stat="identity") +
ylab("Number of Flights") +
xlab("Carriers") +
ggtitle("Total Number of Flights by Carriers ") +
theme(plot.title = element_text(hjust = 0.5,size=15))
By looking at the graph, we can see that UA has the most number of flights.
Filtering the top ten airlines to get unbiased data:
If we take the airlines with less sample size i.e. less number of flights, the results would be indecisive. Therefore, we will have to consider the top ten airlines as they have the significant number of flights.
allflights %>%
filter(carrier %in% c('UA','EV','B6','DL','AA','MQ','US','9E','WN','VX')) %>%
ggplot( aes(y = dep_delay, x = reorder(carrier, dep_delay, na.rm=TRUE, FUN="median"))) + geom_boxplot() +
ylab("Departure Delay")+
xlab("Carriers") +
ggtitle("Departure Delay by Carriers ") +
theme(plot.title = element_text(hjust = 0.5,size=15))
It is observed that DL, AA and MQ have large number of flights having delay greater than 700 min.
Average Departure Delay:
allflights %>%
filter(carrier %in% c('UA','EV','B6','DL','AA','MQ','US','9E','WN','VX'))%>%
filter(dep_delay>0) %>%
group_by(carrier) %>%
summarise(average_departure_delay= mean(dep_delay,na.rm=TRUE),count=n()) %>%
ggplot( aes(y = average_departure_delay, x = reorder(carrier,-count))) +
geom_point(size=2,color="blue") +
xlab("Carriers") +
ylab("Departure Delay in min") +
ggtitle("Average Departure Delay(in min) by Top Ten Carriers") +
theme(plot.title = element_text(hjust = 0.5,size=15))
The graph shows that despite being the busiest carrier, UA has the shortest departure delay.
Setting up the data frames for manufacturer analysis:
# All manufacturers
total_manu <- flights_planes %>%
filter(!is.na(manufacturer)) %>%
group_by(manufacturer) %>%
summarise(total_count=n(),average_dep_delay = mean(dep_delay, na.rm =TRUE)) %>%
arrange(desc(total_count))
#Delayed Manufacturers
delay_manu <- flights_planes %>%
filter(dep_delay>0) %>%
filter(!is.na(manufacturer)) %>%
group_by(manufacturer) %>%
summarise(total_delay_count=n(),average_dep_delay = mean(dep_delay, na.rm =TRUE)) %>%
arrange(desc(total_delay_count))
#Left Join between Total Manufacturers and Delayed Manufacturers
summary_manu <- left_join(total_manu,delay_manu, by="manufacturer") %>%
mutate(percent_delay=100*total_delay_count/total_count )
total_manu %>%
ggplot(aes(x = total_count, y = reorder(manufacturer, total_count), fill=manufacturer )) +
geom_bar(width=0.7, stat = "identity", fill="blue",color="gray") +
theme_bw(base_line_size = 0, base_size = 7) +
xlab("Number of Flights") +
ylab("Manufacturers") +
ggtitle("Manufacturers by Total Number of Flights") +
theme(plot.title = element_text(hjust = 0.3,size=15))
As we can see from the graph above, only top five manufacturers have significant flights. If we consider all the manufacturers in our analysis, results would be inconclusive as the sample size for other manufacturers(other than top five) is remarkably less. Therefore, we will only consider top five manufacturers i.e. Boeing, Embraer, Airbus Industrie, Bombardier Inc and Mcdonnell Douglas Aircraft Co.
summary_manu %>%
filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC','CANADAIR')) %>%
ggplot(aes(y = percent_delay, x = reorder(manufacturer,total_count), fill=manufacturer))+
geom_bar(width=0.7, stat = "identity",fill="blue",colour="gray") +
theme_bw(base_line_size = 0, base_size = 7) +
xlab("Manufacturers") +
ylab("Percentage Delay") +
ggtitle("Manufacturers by Percentage Delay") +
theme(plot.title = element_text(hjust = 0.5,size=15))
total_manu %>%
filter(manufacturer %in% c('BOEING','EMBRAER','AIRBUS','AIRBUS INDUSTRIE','BOMBARDIER INC' ,'CANADAIR'))%>%
ggplot(aes(y = average_dep_delay,x = reorder(manufacturer,total_count),fill=manufacturer))+
geom_bar(width=0.7, stat = "identity",fill="blue",colour="gray") +
theme_bw(base_line_size = 0, base_size = 7) +
xlab("Manufacturers") +
ylab("Average Departure Delay") +
ggtitle("Manufacturers by Average Departure Delay") +
theme(plot.title = element_text(hjust = 0.5,size=15))
From the above two graphs, we can see that Airbus Industrie has the least departure delay.
Summarize the data to get to know more about it.
summary(airlines)
## carrier name
## Length:16 Length:16
## Class :character Class :character
## Mode :character Mode :character
summary(airports)
## faa name lat lon
## Length:1458 Length:1458 Min. :19.72 Min. :-176.65
## Class :character Class :character 1st Qu.:34.26 1st Qu.:-119.19
## Mode :character Mode :character Median :40.09 Median : -94.66
## Mean :41.65 Mean :-103.39
## 3rd Qu.:45.07 3rd Qu.: -82.52
## Max. :72.27 Max. : 174.11
## alt tz dst tzone
## Min. : -54.00 Min. :-10.000 Length:1458 Length:1458
## 1st Qu.: 70.25 1st Qu.: -8.000 Class :character Class :character
## Median : 473.00 Median : -6.000 Mode :character Mode :character
## Mean :1001.42 Mean : -6.519
## 3rd Qu.:1062.50 3rd Qu.: -5.000
## Max. :9078.00 Max. : 8.000
summary(flights)
## ID year month day
## Min. : 1 Min. :2013 Min. : 1.000 Min. : 1.00
## 1st Qu.: 81837 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Median :163674 Median :2013 Median : 7.000 Median :16.00
## Mean :163674 Mean :2013 Mean : 6.565 Mean :15.74
## 3rd Qu.:245510 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :327346 Max. :2013 Max. :12.000 Max. :31.00
## dep_time sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 1 Min. : 500 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 907 1st Qu.: 905 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1122
## Median :1400 Median :1355 Median : -2.00 Median :1535 Median :1554
## Mean :1349 Mean :1340 Mean : 12.56 Mean :1502 Mean :1533
## 3rd Qu.:1744 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1944
## Max. :2400 Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:327346 Min. : 1 Length:327346
## 1st Qu.: -17.000 Class :character 1st Qu.: 544 Class :character
## Median : -5.000 Mode :character Median :1467 Mode :character
## Mean : 6.895 Mean :1943
## 3rd Qu.: 14.000 3rd Qu.:3412
## Max. :1272.000 Max. :8500
## origin dest air_time distance
## Length:327346 Length:327346 Min. : 20.0 Min. : 80
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 509
## Mode :character Mode :character Median :129.0 Median : 888
## Mean :150.7 Mean :1048
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## hour minute time_hour
## Min. : 5.00 Min. : 0.00 Length:327346
## 1st Qu.: 9.00 1st Qu.: 8.00 Class :character
## Median :13.00 Median :29.00 Mode :character
## Mean :13.14 Mean :26.23
## 3rd Qu.:17.00 3rd Qu.:44.00
## Max. :23.00 Max. :59.00
summary(planes)
## tailnum year type manufacturer
## Length:3322 Min. :1956 Length:3322 Length:3322
## Class :character 1st Qu.:1997 Class :character Class :character
## Mode :character Median :2001 Mode :character Mode :character
## Mean :2000
## 3rd Qu.:2005
## Max. :2013
## NA's :70
## model engines seats speed
## Length:3322 Min. :1.000 Min. : 2.0 Min. : 90.0
## Class :character 1st Qu.:2.000 1st Qu.:140.0 1st Qu.:107.5
## Mode :character Median :2.000 Median :149.0 Median :162.0
## Mean :1.995 Mean :154.3 Mean :236.8
## 3rd Qu.:2.000 3rd Qu.:182.0 3rd Qu.:432.0
## Max. :4.000 Max. :450.0 Max. :432.0
## NA's :3299
## engine
## Length:3322
## Class :character
## Mode :character
##
##
##
##
summary(weather)
## origin year month day
## Length:26115 Min. :2013 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :2013 Median : 7.000 Median :16.00
## Mean :2013 Mean : 6.504 Mean :15.68
## 3rd Qu.:2013 3rd Qu.: 9.000 3rd Qu.:23.00
## Max. :2013 Max. :12.000 Max. :31.00
##
## hour temp dewp humid
## Min. : 0.00 Min. : 10.94 Min. :-9.94 Min. : 12.74
## 1st Qu.: 6.00 1st Qu.: 39.92 1st Qu.:26.06 1st Qu.: 47.05
## Median :11.00 Median : 55.40 Median :42.08 Median : 61.79
## Mean :11.49 Mean : 55.26 Mean :41.44 Mean : 62.53
## 3rd Qu.:17.00 3rd Qu.: 69.98 3rd Qu.:57.92 3rd Qu.: 78.79
## Max. :23.00 Max. :100.04 Max. :78.08 Max. :100.00
## NA's :1 NA's :1 NA's :1
## wind_dir wind_speed wind_gust precip
## Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.000000
## 1st Qu.:120.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
## Median :220.0 Median : 10.357 Median :24.17 Median :0.000000
## Mean :199.8 Mean : 10.518 Mean :25.49 Mean :0.004469
## 3rd Qu.:290.0 3rd Qu.: 13.809 3rd Qu.:28.77 3rd Qu.:0.000000
## Max. :360.0 Max. :1048.361 Max. :66.75 Max. :1.210000
## NA's :460 NA's :4 NA's :20778
## pressure visib time_hour
## Min. : 983.8 Min. : 0.000 Length:26115
## 1st Qu.:1012.9 1st Qu.:10.000 Class :character
## Median :1017.6 Median :10.000 Mode :character
## Mean :1017.9 Mean : 9.255
## 3rd Qu.:1023.0 3rd Qu.:10.000
## Max. :1042.1 Max. :10.000
## NA's :2729
Clearing the data:
Remove the cancelled flights and flights that were early. Calculate the average departure delay
#remove all **NA** values from flights dataset.
not_cancelled <- flights %>%
filter(!is.na(arr_delay), !is.na(dep_delay))
not_cancelled
delayed_flights_summary <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
count = n(),
avg_dep_delay = mean(dep_delay)
)
delayed_flights_summary
Plotting a graph between the number of flights and the average departure delay.
ggplot(delayed_flights_summary, aes(avg_dep_delay, count)) +
geom_point() +
scale_y_continuous(breaks= seq(0,600,length.out=7))+
theme(plot.title = element_text(hjust = 0.5))+
geom_vline(xintercept = mean(delayed_flights_summary$avg_dep_delay), color = "red")+
labs(title = "Average Delay Vs No. of flights", x = "Avgerage Departure Delays(in minutes)", y = "Number of flights")
It is noticed that Most of the flights have delay below 50 minutes Max delay is 300 minutes. Mean of departure delay is around 13 minutes
Sorting the data into different parts of the day and removing the less useful data.
flight_data <- flights %>% drop_na(dep_delay)
flight_data <- flights %>% drop_na(sched_dep_time)
flight_data<-mutate(flights, timeOfDay = ifelse(sched_dep_time %in% 600:1159, "Morning",
ifelse(sched_dep_time %in% 1200:1759, "Afternoon",
ifelse(sched_dep_time %in% 1800:2359, "Evening","Night"))))
Plotting a box-plot to get to know how the delay is in different parts of the day.
ggplot(data = flight_data, aes(x = dep_delay, y = c(timeOfDay))) +
geom_boxplot(color = "blue")+
scale_x_continuous(breaks= seq(0,1300,length.out=14))+
labs(y = "Time of a Day", x = "Departure delay", title = "Distribution of Departure Delay during a day")+
theme(plot.title = element_text(hjust = 0.5))
Evenings seem to have most Departure Delay In all the other times of the day, most of the flights have least delay.(Close to zero)
Splitting the data into different airports in New York to see if there is any pattern.
EWR_data<- flights %>% filter(flights$origin == "EWR" & flights$dep_delay>0)
LGA_data<- flights %>% filter(flights$origin == "LGA" & flights$dep_delay>0)
JFK_data<- flights %>% filter(flights$origin == "JFK" & flights$dep_delay>0)
EWR_data<-mutate(EWR_data, Month=month.name[EWR_data$month] )
LGA_data<-mutate(LGA_data, Month=month.name[LGA_data$month] )
JFK_data<-mutate(JFK_data, Month=month.name[JFK_data$month] )
flight_data<-mutate(flight_data, Month=month.name[flight_data$month] )
A Box-plot per month.
For all flights:
ggplot (data = flight_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
geom_boxplot(fill = "orange") +
ylim(0,900)+
labs(x = "Month", y = "Departure Delays(in minutes)", title = "Delay distribution during each month")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 183142 rows containing non-finite values (stat_boxplot).
At EWR Airport:
ggplot (data = EWR_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
geom_boxplot(fill = "orange") +
ylim(0,900)+
labs(x = "Month", y = "Departure Delays(in minutes)", title = "At EWR - Delay distribution during each month")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
At LGA Airport:
ggplot (data = LGA_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
geom_boxplot(fill = "orange") +
ylim(0,900)+
labs(x = "Month", y = "Departure Delays(in minutes)", title = "At LGA - Delay distribution during each month")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
At JFK Airport:
ggplot (data = JFK_data, mapping = aes(x = factor(Month, levels = month.name), y = dep_delay)) +
geom_boxplot(fill = "orange") +
ylim(0,900)+
labs(x = "Month", y = "Departure Delays(in minutes)", title = "At JFK - Delay distribution during each month")+
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
June and July have the maximum delays in all the airports.
Dividing the data into seasons:
flight_data <- flight_data %>%
mutate(season = ifelse(month %in% 9:11, "Fall",
ifelse(month %in% 6:8, "Summer",
ifelse(month %in% 3:5, "Spring",
"Winter"))))
planes_with_delays <- flight_data %>%
filter(dep_delay>0) %>%
select(carrier,hour, origin,month, season, dep_delay,time_hour,tailnum, distance,timeOfDay)
Plotting the seasonal effect on delay
ggplot(data=planes_with_delays, aes(x=season, y=dep_delay, fill= timeOfDay))+ geom_col()+
labs(title = "Delay per season",x=" Season", y="Delay") +
scale_fill_brewer(palette="Spectral")+
theme(plot.title = element_text(hjust = 0.5))
In this part, we try to analyze relationship between weather factors (humidity, visibility, precipitation, and wind speed) and departure delay (average delay and delay percent).
Firstly, we try to see if there is null value in weather dataset
anomalies(read_csv('weather.csv'))
df are created as a new dataframe by joining flights and weather dataset and selecting a number of parameters that related to this analysis .For some variables which are wind_dir, temp, humid, and wind_speed, there are little percent of null values (< 5%). Consequently, null values from these columns are exclude. Meanwhile, wind_gust has 79.56% of null values from total data which means that there are not enough data in this column, so we exclude it as well. We also create new column called ‘delay’ which indicates whether each flight is delay or not delay. If departure delay more than 0, we define it as 1 (delay flight). Otherwise, it is 0 (not delay flight). Inner-join is executed to flights and weather data by time hour and origin to create df.
flights <-read_csv('flights.csv') %>%
select(carrier,hour, origin,month, dep_delay,time_hour) %>%
mutate(delay = case_when(dep_delay>0 ~ 1,
dep_delay<=0 ~ 0))
weather <-read_csv('weather.csv') %>%
filter(!is.na(temp),!is.na(humid),!is.na(wind_dir),!is.na(wind_speed))%>%
select(temp,wind_speed,precip,time_hour,origin,
visib,dewp,humid,wind_dir,pressure)
df <- inner_join(flights,weather,by=c('time_hour','origin'))
anomalies(df)
For pressure, null values are replaced by the mean of total pressure.
df$pressure[is.na(df$pressure)] <- mean(df$pressure,na.rm=TRUE)
anomalies(df)
After that, correlation matrix are plotted to see if there is any correlation between departure delay and weather factors. As you can see, dewpoint variable is highly correlated with temperature variable (89%). For this reason, dewp is excluded.
library(ggcorrplot)
delay_corr_plot <-
df%>%
filter(delay == 1) %>%
select(dep_delay,temp,wind_speed,precip,visib,dewp,humid,wind_dir,pressure)%>%
na.omit
delay_cor <- cor(delay_corr_plot)
ggcorrplot(delay_cor, hc.order = TRUE, type = "lower",
lab = TRUE, outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"), lab_size=3, tl.cex = 10)
Multiple linear regression model is appply to weather data to inspect variable that is not significant when trying to predict departure delay by using a constraint that p-value of each parameter has to be less than 0.05. From the result, wind_dir is exclude as well.
fit <-glm(dep_delay ~., data =delay_corr_plot )
summary(fit)
##
## Call:
## glm(formula = dep_delay ~ ., data = delay_corr_plot)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -94.58 -30.64 -17.50 11.06 1268.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 434.402083 25.232017 17.216 < 2e-16 ***
## temp 0.722589 0.085760 8.426 < 2e-16 ***
## wind_speed 0.466188 0.029497 15.804 < 2e-16 ***
## precip 41.276196 4.367207 9.451 < 2e-16 ***
## visib -0.367232 0.096624 -3.801 0.000144 ***
## dewp -0.587777 0.091906 -6.395 1.61e-10 ***
## humid 0.589664 0.044960 13.115 < 2e-16 ***
## wind_dir -0.003123 0.001716 -1.820 0.068709 .
## pressure -0.441851 0.024148 -18.298 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2840.981)
##
## Null deviance: 363813974 on 123924 degrees of freedom
## Residual deviance: 352043033 on 123916 degrees of freedom
## AIC: 1337135
##
## Number of Fisher Scoring iterations: 2
Apart from above analysis, we assume that temperature does not affect delay directly so that there will be 5 factors taking into consideration which are humidity, visibility, wind speed, precipitation, and pressure.
First, monthly trend of the factor is plotted to analyze seasonal trend.
df%>%
group_by( month)%>%
summarise(average_humidity = mean(humid))%>%
ggplot(aes(x=month, y=average_humidity))+
geom_line()+
scale_x_continuous(breaks = seq(1,12,by=1))+
labs(title = "Monthly Trend of Humidity",
y = "Relative Humidity (%)",
x = "Month")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
There are high level of humidity in May to July and it drastically decreased in April and November. However, there is no apparent seasonal trend.
To analyze the relationship between average departure delay and humidity, scatter plot is generated to observe the trend. X-axis demonstrates relative humidity which is cut into sections of 5. Average delay is based on y-axis while colour of each plot represents number of flights in that level of humidity.
humidity <- df %>%
group_by(humid = cut(humid,breaks= seq(0,100, by =10))) %>%
summarise(
density = n(),
average_delay = mean(dep_delay),
delay_percent = mean(delay)*100)
humidity %>%
ggplot(aes(x=humid, y= average_delay)) +geom_point(aes(color=density),size=5) +ylab("Average Delay (min)")+ggtitle("Average delay vs humid with density plot")+
scale_color_gradient(low = 'blue',high = 'red')+xlab('Relative Humidity (%)')
From above graph, an increasing average delay trend can be seen as
relative humidity is increasing. Most flights have level of humidity at
35 - 50 percent whereas the delay is approximately 10 minutes.
humidity %>%
ggplot(aes(x=humid, y= delay_percent,fill=density)) +geom_col() +ylab("Delay percent")+ggtitle("Delay percent vs humidity")+
scale_fill_gradient(low = 'blue',high = 'red')+xlab('Relative Humidity (%)')
For delay percent, there is an increasing trend compared to humidity
when looking at the bar chart. y-axis represents delay percent while
colour of the bar graphs display the number of flights for each relative
humidity range.
Monthly trend of visibility is plotted
df%>%
group_by( month)%>%
summarise(average_visibility = mean(visib))%>%
ggplot(aes(x=month, y=average_visibility))+
geom_line()+
scale_x_continuous(breaks = seq(1,12,by=1))+
labs(title = "Monthly Trend of Visibility",
y = "Visibility (miles)",
x = "Month")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
There are high visibility for most time of the year except in January,
Mar, and December. There is no apparent seasonal trend.
Two graph is plot which are relationship between visibility and average delay, delay percent. In these two plots. scatter plot is used to analyzed the trend with trend line in order to make it easier to find overall trend. y-axis presents delay percent and average delay while x-axis presents visibility in mile.
visibility <- df %>%
group_by(visib) %>%
summarise(
density = n(),
average_delay = mean(dep_delay),
delay_percent = mean(delay)*100)
visibility %>% ggplot(aes(x=visib, y=delay_percent)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Average delay vs visibility")+ylab("Average delay") +xlab('Visibility (mile)')
visibility %>% ggplot(aes(x=visib, y=average_delay)) +geom_point()+geom_smooth()+ggtitle("Delay Percent vs visibility")+ylab("Delay percent")+xlab('Visibility (mile)')
From the graphs, we can see the decreasing trends in average delay and
delay percent compared to visibility which is make sense. If the
visibility is low, there should be more delay since it is harder to
control the plane. For delay percent, it drastically decrease in 0-2.5
visibility range. It might be the range that visiblity hugely affects
the delay.
Monthly trend of wind speed is plotted. Wind speed is highest in March and lowest in August.
df%>%
group_by( month)%>%
summarise(average_windspeed = mean(wind_speed))%>%
ggplot(aes(x=month, y=average_windspeed))+
geom_line()+
scale_x_continuous(breaks = seq(1,12,by=1))+
labs(title = "Monthly Trend of Wind Speed",
y = "Average Speed (mph)",
x = "Month")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
For wind speed, scatter plots are demonstrated which is the same as visibility graphs while y-axis displays the delays and x-axis demonstrates wind speed in mile per hour. However, when compared to visibility, opposite trend is given. Average delay and delay percent tend to increase as the wind speed increases from 0-25 mph. After 25 mph, there are fluctuation of delays which is hard to indicate what trend it is.
wind_speed <- df %>%
group_by(wind_speed) %>%
summarise(
density = n(),
average_delay = mean(dep_delay),
delay_percent = mean(delay)*100)
wind_speed %>%
ggplot(aes(x=wind_speed, y=average_delay)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Average delay vs Wind Speed (mph)")+ylab("Average Delay (minute)")+xlab('Wind Speed')
wind_speed %>%
ggplot(aes(x=wind_speed, y=delay_percent)) +geom_point()+geom_smooth(method = 'lm')+ggtitle("Delay Percent vs wind Speed (mph)")+ylab("Delay Percent") +xlab('Wind Speed')
Because of the fluctuation of delay, we create a histogram to investigate on number of flights and wind speed.
df %>% ggplot(aes(x=wind_speed)) + geom_histogram(color="darkblue", fill="lightblue") + ggtitle("Number of flights vs Wind speed") + ylab("Number of flights")+
xlab('Wind Speed')
It can be seen that there are not many flights experiencing wind speed
more than 20 mile per hour which is the range of delay fluctuation. This
means that we do not have enough information to analyze overall trend
that wind speed is more than 20 so it fluctuates. For the visibility,
most of the flights have high visilibity, consequently, there are
fluctuation of data in low visibility range.
Monthy trend of precipitation is plotted.
df%>%
group_by(month)%>%
summarise(average_precipitation = mean(precip))%>%
ggplot(aes(x=month, y=average_precipitation))+
geom_line()+
scale_x_continuous(breaks = seq(1,12,by=1))+
labs(title = "Monthly Trend of Precipitation",
y = "Precipitation (inch)",
x = "Month")+
theme_bw()+
theme(plot.title = element_text(hjust = 0.5))
From the graph, precipitation is the highest at June which is reasonable since humidity is high in June as well. From the analysis in first part, there are many delays in June. For this reason, precipitation and humidity may affect delay of flights directly.
Focusing on the relationship with precipitation, there are an small increasing trend for average delay and delay percent from precipitation between 0 - 0.5 inch. Beyond that, there are fluctuation in data but the trendlines of both graphs still rise as an upward trend.
precipitation <- df %>%
group_by(precip) %>%
summarise(
density = n(),
average_delay = mean(dep_delay),
delay_percent = mean(delay)*100)
precipitation %>% ggplot( aes(x = precip, y = average_delay)) +
labs(x = "Precipitation (inch)", y = "Average Delay (min)") +
stat_smooth() +
geom_point(color = "gray40", alpha = .5) + ggtitle('Average Delay vs Precipitation')
precipitation %>% ggplot( aes(x = precip, y = delay_percent)) +
labs(x = "Precipitation (inch)", y = "Delay Percent") +
stat_smooth() +
geom_point(color = "gray40", alpha = .5)+ggtitle('Delay Percent vs Precipitation')
We create scatterplot between pressure and delay which has x-axis as pressure and y-axis as average delay and delay percent. The trends are display as same as the trends in percipitation analysis. It is overall upward trend with a fluctuation when pressure in more than 0.5 mbars.
pressure <- df %>%
group_by(pressure) %>%
summarise(
density = n(),
average_delay = mean(dep_delay),
delay_percent = mean(delay)*100)
precipitation %>% ggplot( aes(x = precip, y = average_delay)) +
labs(x = "Pressure (millibars)", y = "Average Delay (min)") +
stat_smooth() +
geom_point(color = "gray40", alpha = .5) + ggtitle('Average Delay vs Pressure')
precipitation %>% ggplot( aes(x = precip, y = delay_percent)) +
labs(x = "Pressure(millibars)", y = "Delay Percent") +
stat_smooth() +
geom_point(color = "gray40", alpha = .5)+ ggtitle('Delay Percent vs Pressure')
Although having a significant number of flights, Carrier UA has the least departure delay and carriers DL, AA and MQ have the more departure delays. Carrier-specific delays happen because of Air Traffic Control issues, Passenger Issues, Crew Issues or maybe maintenance issues such as the mechanic have noticed some unexpected concern which could increase the departure delay time.
Regarding the manufacturer analysis, Airbus Industrie has the least departure delay. This could be due to many factors such as specifications of engine, structure of the plane, etc.
From the above graphs, it is noticed that departure delays in the given data of flight information in 2013 occur more during spring and summer, especially during the months of June and July (Yahoo news,2013). As we know spring and summer seasons are prone to thunderstorms which cause a lot of issues with visibility, wind speed and taking of flights. Also from the report given by the US Department of transportation, We know that 7.39% of the delay was caused by aviation system delay in June whereas it was 6.16% in July.10.43% of the delay was caused by the late arrival of aircraft in June, while it was 9.01% in July. Some of the delays were caused by maintenance and crew problems, 7.06% from June and 6.59% from July. Out of all the late flights,41.80% of June are caused by extreme weather or National aviation system delay and 37.91% of July (US Department of transportation, 2014).
Also, In the time of the day graph, we notice that evenings have the most delay when compared to the rest of the day. Delay in the evenings is usually caused by a few factors. Which are Air traffic, Air traffic control, maintenance, baggage loading and also fuel refilling. In the evening, it is said that a lot of the delay is also caused due to the shift change of employees. And also, the Delay of one flight has led to delays in the next flights (Forbes,2019).
From weather analysis, there are apparent effects to departure delay by weather factors. As for humidity, wind speed, precipitation, and pressure, they tend to make more departure delays if their value increase. In the other hand, high visibility tend to lessen flight delay. Especially for precipitation and humidity, as they have their highest value and in June, and flight delay are increasing during that month as well. So these two factors could affect departure delay directly.
We are aware that delays brought on by bad or extreme weather are inevitable. However, the other elements, such as air traffic, maintenance, luggage loading, fuel replenishment, and delays brought on by staff members changing shifts, are reducible. Airlines can enhance the system to keep flights and luggage loading running smoothly, as well as properly changing personnel shifts so that the flights and airport are not disrupted. To pinpoint accurately which element is causing the delay, we need further information.
We split the workings of dashboard
Siwach : Departure airport, Day, Month, Weather, Number of flights and delays
Suraj: Carrier graph and information about the graphs.
Harshini: Average delay of each destination
We split our EDA equally into three questions.
Siwach: Answer for the question ‘How weather affects departure delay?’
Suraj: Answer for the question ‘Is there any carrier or manufacturer that performs significantly better than the rest?’
Harshini: Answer for the question ‘Is there any pattern in which delay occurs?’
1.Forbes, 2019. Most Flight Delays? In The Evening And Unrelated To Bad Weather [Online]. Available from: https://www.forbes.com/sites/garystoller/2019/06/03/most-flight-delays-in-the-evening-and-unrelated-to-bad-weather/ [Accessed on 06 December 2022].
2.US Department of transportation,2014. July Airline On-Time Performance Up, Cancellations Down From Previous Year, June [Online]. Available from: https://www.transportation.gov/briefing-room/july-airline-time-performance-cancellations-down-previous-year-june [Accessed on 06 December 2022].
3.Yahoo news,2013. Flight delays pile up amid FAA budget cuts [Online].Available from: https://news.yahoo.com/flight-delays-pile-amid-faa-budget-cuts-222819836--finance.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAI5gJw7TsvbMWk7GfF2UTv8tzkyZODYE0JiPPIDeu78B_R3IrEcQ_pYl9C77FVIP3t9ufAVm779rQKzwBcXjRGMryD3h7qTUzvxdGtDyNUBBRWewBpM_Q1ZDh2hCmlUZ1l18AywWXxqThFYKAvq9Zi_oF2l-GDVFr1rH4S1Kuuza [Accessed on 06 December 2022].